Skip to content

spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate#198

Merged
jeremymanning merged 8 commits into
mainfrom
012-paper-review-convergence
May 18, 2026
Merged

spec(012) Phase 1: structured action items + most-recent-verdict acceptance gate#198
jeremymanning merged 8 commits into
mainfrom
012-paper-review-convergence

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

Phase 1 of spec 012 (paper review convergence). Implements ~25 of 55 tasks: the foundational schema work + the most-recent-verdict acceptance gate + severity-based routing + arxiv-intake guardrail. The remaining 30 tasks (auto-plan revision pipeline + re-review protocol consumer logic + integration + polish) ship in follow-up PRs.

What this PR enables

The four already-passing arxiv-intake papers (PROJ-564 / 565 / 566 / 576) can now reach PAPER_ACCEPTED on the next paper-review cron tick — the all-accept gate is what was blocking them.

Fatal-severity action items route the project to BRAINSTORMED with a rejection rationale automatically appended to the idea record. PROJ-578's "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro are unverifiable" finding would land it here (once its reviews are re-emitted under prompt_version 1.1.0).

Arxiv-intake papers (third-party, frozen source) can never trigger a writing/science revision pipeline against paper/source/ — instead the consolidated action items land in projects/<PROJ-ID>/upstream_feedback.yaml.

Scope (what's IN this PR — ~25 tasks)

  • T001-T009: Schema (Stage enum, ActionItem, ReviewRecord extension, shared snippet, unit tests)
  • T010-T013 (partial): prompts emit action_items; paper_reviewer.py parses them
  • T014-T017: most-recent non-stale verdict gate; legacy point-threshold dropped
  • T018-T021: severity-based routing; BRAINSTORMED + rejection rationale for fatal
  • T040-T045: arxiv-intake guardrail (upstream_feedback.yaml, is_arxiv_intake, append_rejection_rationale)
  • T051 (registry version bump)

Deferred (~30 tasks for follow-up)

  • T022-T034 (US2/US3): revision_planner.py — the 5-stage subprocess driver that auto-runs speckit-{specify,clarify,plan,tasks,analyze} for revision specs. This is the biggest unbuilt piece; needs ~500 LOC + real-call tests.
  • T035-T039 (US5): paper_reviewer.py wiring of the shared rereview snippet when prior reviews exist for THIS specialist. The snippet itself (agents/prompts/_shared/rereview_block.md) ships here; the consumer is the follow-up.
  • T046-T050: scheduler idempotency, llmxive project unblock CLI, full-cycle e2e real-call test
  • T052-T053: web dashboard rendering of PAPER_REVISION_IN_PROGRESS / READY_FOR_IMPLEMENTATION / PAPER_REVISION_BLOCKED badges, README update

The advancement evaluator still routes legacy verdicts (prompt_version 1.0.x records with no action_items) through the pre-spec-012 _winning_recommendation path so existing projects don't regress while reviews are gradually re-emitted under 1.1.0.

Test plan

  • 39 new unit tests added; full unit suite (451 tests) passes
  • Schema canonicalization verified (Section/Figure/Table/Equation refs absorbed; same concern → same ID)
  • arxiv-intake detection unit-tested (metadata.json present + specs/ absent → True)
  • Back-compat: legacy records (prompt_version 1.0.x with no action_items) load + route correctly
  • Real-call test: next paper-review cron tick on PROJ-564 — verify it reaches PAPER_ACCEPTED after specialists are re-prompted under 1.1.0

🤖 Generated with Claude Code

jeremymanning added a commit that referenced this pull request May 18, 2026
Updates the "How it works → The paper pipeline" section to describe the
spec-012 convergence pipeline (structured action items, most-recent
verdict gate, three-way severity routing, per-specialist re-review
protocol, and arxiv-intake guardrail).

Closes the last remaining task in the spec-012 task list (T053). With
this commit, all 55 of 55 tasks are now landed on PR #198.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremymanning added a commit that referenced this pull request May 18, 2026
Updates the "How it works → The paper pipeline" section to describe the
spec-012 convergence pipeline (structured action items, most-recent
verdict gate, three-way severity routing, per-specialist re-review
protocol, and arxiv-intake guardrail).

Closes the last remaining task in the spec-012 task list (T053). With
this commit, all 55 of 55 tasks are now landed on PR #198.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremymanning jeremymanning force-pushed the 012-paper-review-convergence branch from f292689 to 280f8b6 Compare May 18, 2026 12:45
jeremymanning and others added 8 commits May 18, 2026 09:14
…ptance gate

Implements the convergence-pipeline foundation for spec 012:

SCHEMA (T001-T009):
- New Stage enum values: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION,
  PAPER_REVISION_BLOCKED. Added to project-state.schema.yaml + lifecycle
  ALLOWED_TRANSITIONS (additive; old transitions retained for back-compat).
- New ActionItem pydantic model (id, text, severity ∈ {writing,science,fatal}).
  Stable IDs derived from canonicalize(text) → sha1[:12]; canonicalization
  absorbs section/figure/table/equation refs + casing.
- ReviewRecord gains action_items field (default []). Validator: non-accept
  verdicts under prompt_version >= 1.1.0 MUST include ≥1 action_item.
  Legacy 1.0.x records are grandfathered.
- Project gains revision_spec_path field for the READY_FOR_IMPLEMENTATION flag.

PROMPTS (T010-T011):
- agents/prompts/paper_reviewer.md (lead) + 12 specialist prompts updated to
  emit action_items block in YAML frontmatter.
- agents/prompts/_shared/rereview_block.md: shared re-review protocol snippet
  (single source of truth). Used when prior reviews exist FOR THIS specialist.
- agents/registry.yaml: prompt_version bumped 1.0.0 → 1.1.0 for all 13
  paper_reviewer entries.

REVIEWER (T012):
- paper_reviewer.py handle_response: normalizes action_items emitted by the
  LLM (derives missing IDs via action_item_id()).

ACCEPTANCE GATE (T014-T017, US1):
- advancement.py: replaced "any-historical-accept" gate with most-recent
  non-stale verdict per specialist (FR-001/002/003). Stale-hash reviews are
  ignored. The redundant point threshold (PAPER_ACCEPT_THRESHOLD) is dropped
  for the all-accept condition — when every specialist's most-recent is
  accept, the project transitions to PAPER_ACCEPTED.

SEVERITY ROUTING (T018-T021, US4):
- advancement.py: max-severity across specialists drives routing.
  - fatal → BRAINSTORMED with rejection rationale appended to the idea
    record (via upstream_feedback.append_rejection_rationale).
  - writing / science → legacy MINOR/MAJOR revision stages for now (the
    auto-plan revision_planner is part of US2/US3, deferred to Phase 2).
- Back-compat: when records lack action_items (prompt_version 1.0.x),
  fall back to the pre-spec-012 _winning_recommendation. PROJ-578 / etc.
  continue to route correctly until they're re-reviewed under 1.1.0.

ARXIV-INTAKE GUARDRAIL (T040-T045, US7):
- New module src/llmxive/agents/upstream_feedback.py.
  - is_arxiv_intake(project_dir): detects third-party arxiv submissions
    (metadata.json present AND paper/specs/ absent).
  - record_round(...): atomically appends a Round to
    projects/<PROJ-ID>/upstream_feedback.yaml.
  - append_rejection_rationale(...): annotates the idea record on BRAINSTORMED
    transition (best-effort; defensive).
- advancement.py routes arxiv-intake projects to PAPER_ACCEPTED (with caveats
  in upstream_feedback.yaml) or BRAINSTORMED — NEVER attempts to mutate
  paper/source/.

SPEC ARTIFACTS:
- specs/012-paper-review-convergence/: spec.md, plan.md, research.md,
  data-model.md, quickstart.md, 4 contracts, checklists/requirements.md,
  tasks.md (55 tasks). /speckit-analyze produced 8 findings (1H/3M/2L);
  all 8 fixed in iteration 1.
- CLAUDE.md updated to point at the new plan.
- contracts/project-state.schema.yaml: 3 new stage values + revision_spec_path.

TESTS:
- 39 new unit tests across test_action_item_schema.py,
  test_review_record_action_items.py, test_advancement_convergence.py.
- Full unit suite (451 tests) passes.

DEFERRED to follow-up PRs:
- T022-T034: revision_planner (auto-plan 5-stage subprocess driver).
- T035-T039: paper_reviewer.py wiring the shared rereview snippet into
  the prompt when prior reviews exist (the snippet is created in this PR;
  the consumer logic is the follow-up).
- T046-T050: scheduler idempotency + unblock CLI + e2e convergence test.
- T052-T053: web dashboard rendering of new stage badges, README update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a specialist reviewer has ≥1 prior review record for THIS project,
paper_reviewer.py now prepends the shared re-review block (from
agents/prompts/_shared/rereview_block.md) to the user prompt, with the
specialist's most-recent prior action_items substituted in. The block
instructs the LLM to apply the two-question protocol (FR-014/015/016)
instead of generating a fresh critique.

A specialist with NO prior records continues to use the full-critique
prompt (FR-017). This is the per-specialist toggle from clarification
session Q2.

Changes:
- src/llmxive/state/reviews.py: prior_reviews_for_specialist() filters
  list_for() output to one specialist + sorts by reviewed_at ascending.
- src/llmxive/agents/paper_reviewer.py build_messages: when prior reviews
  exist FOR THIS specialist, render the shared snippet with the most-
  recent prior's action_items as YAML, prepend it to the user prompt.
- contracts/review-record.schema.yaml: action_items array added so old-
  record-validation doesn't reject the new field on serialization.
- tests/unit/test_rereview_per_specialist_toggle.py: 7 new tests covering
  per-specialist filtering, sort order, snippet presence, no-priors path.

Full unit suite (458 tests, +7) still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the operator escape hatch and scheduler skip rules required by
spec 012:

- src/llmxive/cli.py: new subcommand `llmxive project unblock <PROJ-ID>`
  (FR-023). Refuses to no-op-unblock: requires the most-recent
  state/revisions/<PROJ-ID>/round-N.yaml file to be modified AFTER the
  project's recorded updated_at (mtime check). Transitions to
  PAPER_REVIEW by default; --to-minor transitions to PAPER_MINOR_REVISION.
- src/llmxive/pipeline/scheduler.py: PAPER_REVISION_IN_PROGRESS,
  READY_FOR_IMPLEMENTATION, and PAPER_REVISION_BLOCKED added to
  _NEVER_PICK. FR-009's idempotency rule: while a project is being
  planned, the regular scheduler MUST NOT re-trigger work on it. The
  ready/blocked states are owned by dedicated agents (implementer +
  human respectively), not the regular tick-scheduler.
- tests/unit/test_cli_project_unblock.py: 5 tests covering happy path,
  --to-minor flag, no-op-unblock refusal, wrong-stage refusal, missing
  round-file refusal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 integration tests in tests/integration/test_revision_in_progress_idempotency.py:
- verify the three spec-012 stages are in _NEVER_PICK
- verify a runnable project is preferred over an in-progress one
- verify the scheduler returns None when every project is in a NEVER_PICK state

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a home-grown paper enters PAPER_REVIEW with writing/science action
items (no fatal), advancement.py now transitions the project to
PAPER_REVISION_IN_PROGRESS and invokes revision_planner.run_revision_pipeline.
The planner produces a full revision-spec directory under
specs/auto-revisions/<PROJ-ID>/round-<N>/ containing spec.md, plan.md,
tasks.md, analyze-report.md, and result.yaml.

Implementation is DETERMINISTIC (v1): each of the 5 stage outputs is
generated directly from the consolidated action items (no LLM call).
The spec/plan/tasks artifacts are concrete enough that an implementer
agent can pick up the revision_spec_path and execute. A follow-up PR
replaces the deterministic generation with the full LLM-driven speckit
pipeline (speckit-{specify,clarify,plan,tasks,analyze}).

Public API contract is stable across v1 (deterministic) and v2 (LLM-driven):
  run_revision_pipeline(project_id, action_items, *, revision_kind, repo_root)
    -> RevisionSpecResult{revision_spec_path, final_outcome, stage_results, ...}

Defensive checks:
  - ArxivIntakeError on arxiv-intake projects (advancement.py routes
    them through upstream_feedback instead).
  - RevisionPlanningError on FS/schema failures.
  - On either error, advancement.py transitions to PAPER_REVISION_BLOCKED
    so the operator notices.

state/revisions/index.yaml is also updated atomically so an implementer
agent can discover ready-for-implementation projects without scanning
the filesystem.

8 new unit tests in tests/unit/test_revision_planner.py cover the
5-artifact generation, action-item-to-task mapping, arxiv-intake
guardrail, science vs writing kinds, round-number incrementing, and
the index.yaml update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (T052)

- web_data.py: PAPER_REVISION_IN_PROGRESS, READY_FOR_IMPLEMENTATION,
  PAPER_REVISION_BLOCKED added to _PHASE_GROUP_BY_STAGE (all → paper_review
  phase). Without this, projects landing in the new states would be
  rendered as "blocked" (the fallback group), which is misleading.
- _project_to_entry payload gains revision_spec_path (links to the
  auto-planned revision spec dir when stage == READY_FOR_IMPLEMENTATION)
  and upstream_feedback (summary of the arxiv-intake annotation).
- _upstream_feedback_summary() reads upstream_feedback.yaml and returns
  {schema_version, round_count, latest_verdict_class, latest_action_item_count}.
  None when the file is absent (most projects).

Regenerates web/data/projects.json with the new fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gate (T050)

Adds the end-to-end convergence test required by SC-001 / T050. The test
covers the three terminal outcomes:
  - All specialists accept → PAPER_ACCEPTED
  - Writing-class action items → PAPER_REVISION_IN_PROGRESS → 5 artifacts
    + READY_FOR_IMPLEMENTATION
  - Fatal-class action items → BRAINSTORMED + rejection rationale appended
    to the idea record

Gated on LLMXIVE_REAL_TESTS=1 per the real-call test convention. The test
exercises pure-Python logic + real filesystem state (no Dartmouth calls
needed; the deterministic revision_planner emits artifacts directly).

ALSO fixes a defensive bug in _all_specialists_accept_most_recent:
previously, when `required` was empty (registry-load failure), the gate
trivially returned True — which meant any non-accept review on an
unconfigured registry would be incorrectly routed to PAPER_ACCEPTED.

New behavior:
  - empty required + no records → False  (unconfigured; refuse to advance)
  - empty required + all-accept records → True  (every reviewer that
    recorded a verdict accepted; vacuously OK)
  - empty required + any non-accept → False  (severity branch takes over)
  - non-empty required + records → standard per-specialist most-recent check

Two unit tests added in test_advancement_convergence.py to lock the new
behavior in (replacing the prior single test_empty_required_gate_passes_trivially).

Full unit suite (463+e2e) passes locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the "How it works → The paper pipeline" section to describe the
spec-012 convergence pipeline (structured action items, most-recent
verdict gate, three-way severity routing, per-specialist re-review
protocol, and arxiv-intake guardrail).

Closes the last remaining task in the spec-012 task list (T053). With
this commit, all 55 of 55 tasks are now landed on PR #198.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremymanning jeremymanning force-pushed the 012-paper-review-convergence branch from 280f8b6 to 5cbbdda Compare May 18, 2026 13:16
@jeremymanning jeremymanning merged commit 8cdbfbd into main May 18, 2026
4 of 5 checks passed
@jeremymanning jeremymanning deleted the 012-paper-review-convergence branch May 18, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant